In this lesson, we’ll learn how TrueTime guarantees correctness properties around concurrency control and how those properties ensure the following:

1. Externally consistent transactions

2. Reads from the past that do not block any ongoing transaction

3. Read-only transactions that do not require any lock

These features make it possible to ensure that at any time, t, the full database audit read will see the exact results of all transactions that are committed as of t.

We will use Paxos write and Spanner clients write to differentiate between both writes.

Timestamp management#

The implementation of Spanner supports reads from a snapshot and read-only and read-write transactions. The system implements the read-write transactions as standalone writes, whereas the read-only transactions are implemented as non-snapshot standalone reads. The following table shows the operation types Spanner supports and their respective concurrency control.

Operation Types and Their Concurrency Control

Operation

Concurrency Control

Read-Write Transaction

Pessimistic

Read-Only Transaction

Lock-free

Snapshot Read, client-provided timestamp

Lock-free

Snapshot Read, client-provided bound

Lock-free

Read-only transactions#

The advantages of snapshot isolation are utilized by the read-only transaction. A read-only transaction requires pre-declaration of being read-only and not having any writes, since it is not a read-write transaction without any writes. Spanner executes reads within a read-only transaction without locking at a timestamp tt which is determined by the system. Executing the read at tt avoids blocking the incoming writes. This is an example of any suitably up-to-date replica that may handle the reads in a read-only transaction.

A snapshot read is where we read from the past without acquiring a lock. When performing a snapshot read, a client can either provide a timestamp themselves, or the system can determine the timestamp on the basis of an upper bound on the timestamp’s expiration provided by the client. A snapshot read can be executed on any up-to-date replica in either scenario.

Snapshot reads do not require locks
Snapshot reads do not require locks

Once a timestamp has been chosen, the commit is unavoidable for both snapshot reads and read-only transactions. The commit does not happen if the data has been garbage collected at that timestamp. Therefore, clients can escape a retry loop's buffering of results by receiving them immediately. Clients can continue their queries on other servers if a server goes down by sending the same timestamp and reading location.

Paxos leader leases#

Timed leases are used in our Paxos implementation for sustainable leadership, that is, by default, 10 seconds. A node sends the request for timed lease votes to other participants to become a leader. When that node has a quorum of lease votes from other participants, it means it has the lease to become a leader.

For leader election, Paxos uses the Bully algorithm. If the proposer is not proposing to be a leader, any node can aim to be a leader. For a particular node to be accepted by the cluster, the node sends its server IDs to all peers within the cluster. The peers respond with their server ID. If all responses are smaller than the node's server ID, that node becomes the leader.

Created with Fabric.js 3.6.6
These are nodes in a cluster. No leader has been selected yet

1 of 6

Created with Fabric.js 3.6.6
Node A wants to be a leader

2 of 6

Created with Fabric.js 3.6.6
Node A sends its server ID to other participants in the cluster

3 of 6

Created with Fabric.js 3.6.6
Node B and Node C check if their server IDs are less than the received server ID

4 of 6

Created with Fabric.js 3.6.6
Node B and Node C return their server ID to Node A

5 of 6

Created with Fabric.js 3.6.6
Node A becomes the leader

6 of 6

If a peer receives the ID of a node that wants to be a leader, and its ID is greater than the received ID, then instead of returning its ID, it starts an election to become the leader. The bully algorithm avoids livelock issues.

Created with Fabric.js 3.6.6
These are nodes in a cluster. No leader has been selected yet

1 of 8

Created with Fabric.js 3.6.6
Node A wants to be a leader

2 of 8

Created with Fabric.js 3.6.6
Node A sends its server ID to other participants in the cluster

3 of 8

Created with Fabric.js 3.6.6
Node B and Node C check if their server IDs are less than the received server ID

4 of 8

Created with Fabric.js 3.6.6
Since the server ID of Node B is greater than Node A, it sends its server ID to other participants, while Node C sends its server ID to Node A

5 of 8

Created with Fabric.js 3.6.6
Node A and Node C check if their server IDs are less than the received server ID

6 of 8

Created with Fabric.js 3.6.6
Node A and Node C return their server ID to Node B

7 of 8

Created with Fabric.js 3.6.6
Node B becomes a leader

8 of 8

After a successful write, a replica's lease vote is automatically extended, and the leader will ask for a lease vote extension if it is about to expire. The lease interval of the leader starts when it receives the necessary number of votes to constitute a quorum, and the lease interval will end when the leader no longer has a quorum of votes. Within a Paxos group, a Paxos leader can change over time for different reasons. Spanner enforces a disjointness invariant, which implies that lease intervals of leaders are disjointed across all leaders of the shard over time.

Spanner implementation allows an abdication of a Paxos leader by freeing the lease votes of the participants. Spanner limits the conditions under which abdication is allowed to ensure that the disjointness invariant is maintained. The conditions are as follows:

  • Consider that the Paxos leader defines the maximum timestamp as SmaxS_{max}.

  • Before abdication, the Paxos leader should wait for TT.after(smax)TT.after(s_{max}).

The abdication of the Paxos leader is explained in the following slides:

Created with Fabric.js 3.6.6
S max is defined, which is the maximum timestamp of the Paxos leader

1 of 7

Created with Fabric.js 3.6.6
The Paxos leader requests TrueTime using the TT.after function that returns true if S max has definitely passed

2 of 7

Created with Fabric.js 3.6.6
TrueTime returns false if S max has not passed

3 of 7

Created with Fabric.js 3.6.6
The Paxos leader has to wait until TrueTime returns true

4 of 7

Created with Fabric.js 3.6.6
The Paxos leader requests TrueTime using the TT.after function again

5 of 7

Created with Fabric.js 3.6.6
TrueTime returns true if S max has passed

6 of 7

Created with Fabric.js 3.6.6
The Paxos leader has successfully abdicated, and it becomes a participant now

7 of 7

To work, for each Paxos group, we require a disjointness invariant to be true, which means that the lease intervals of each Paxos group leader should be disjointed from other Paxos group leaders.

Read and write transactions#

The read and write transactions use two-phase locking. The timestamp can be assigned to a transaction after acquiring all locks and before releasing them. Spanner assigns a timestamp for any given transaction. Spanner enforces the following invariants:

  1. The disjointness invariant

  2. The external consistency invariant

Spanner assigns timestamps to Paxos writes in increasing order in each Paxos group. The assignment is in monotonically ordered across leaders too. The assignment of timestamps in monotonically ascending order is a simple task for a single leader replica. Using the disjointness invariant, all leaders must comply with this rule: a leader can assign timestamps only during their leader lease. It means when a timestamp ss is assigned to Paxos leader, smaxs_{max} is incremented to maintain disjointness.

In the external consistency invariant, if the T2T2 transaction is started after T1T1's commit, we can infer from it that T2T2's commit timestamp should be greater than T1T1's commit timestamp. Consider the following for the transaction TiT_{i}:

  • Start of TiT_{i}: eistarte_{i}^{start}

  • Commit of TiT_{i}: eicommite_{i}^{commit}

  • Commit timestamp of TiT_{i}: sis_{i}

The tabst_{abs} is the absolute of time, then the invariant becomes: if tabs(e1commit)<tabs(e2start)t_{abs}(e_{1}^{commit}) < t_{abs}(e_{2}^{start}) then s1<s2s_{1} < s_{2}.

Created with Fabric.js 3.6.6
T1 commits in the database with a timestamp s1

1 of 3

Created with Fabric.js 3.6.6
T2 commits in the database with a timestamp s2

2 of 3

Created with Fabric.js 3.6.6
T1 was committed before T2, so s2 will be greater than s1

3 of 3

The following protocols are followed to assign timestamps to transactions. These protocols ensure the invariant is followed. The commit request's arrival event at the coordinator leader is eiservere_{i}^{server} for the write transaction TiT_{i}.

  1. Start: The coordinator leader assigns a commit timestamp sis_{i} to TiT_{i}. The timestamp assigned is not less than the TT.now().latestTT.now().latest, computed after eiservere_{i}^{server} (remember that TT.now()TT.now() provides an interval [earliest,latest][earliest, latest]).

  2. Commit wait: The coordinator leader waits until TT.after(si)TT.after(s_{i}) or TT.now().earliest>siTT.now().earliest > s_{i}is true. After the wait, the client can see the updated data. The commit wait also makes sure that for TiT_{i}, si<tabs(eicommit)s_{i} < t_{abs}(e_{i}^{commit}).

Created with Fabric.js 3.6.6
The transaction starts and acquires the required locks

1 of 7

Created with Fabric.js 3.6.6
We pick a timestamp s by calling TrueTime

2 of 7

Created with Fabric.js 3.6.6
s is an upper bound on the actual time that falls in the interval with respect to the acquisition of the locks

3 of 7

Created with Fabric.js 3.6.6
We will wait till the uncertainty is out

4 of 7

Created with Fabric.js 3.6.6
Then we release the locks

5 of 7

Created with Fabric.js 3.6.6
The total wait is called commit wait

6 of 7

Created with Fabric.js 3.6.6
The length of commit wait is approximately two times the average ϵ

7 of 7

This is the monotonicity invariant that allows Spanner to reliably decide if a replica's state is up-to-date enough to fulfill a read. While traditional databases employ strict two-phase locking and single-version storage to guarantee external consistency. When traditional databases perform a "strong read," requesting the latest data, Spanner gains a read lock on the data, preventing any updates on the data that is part of the read transaction.

Serving reads at a timestamp#

Each replica in Spanner keeps track of the maximum timestamp, tsafet_{safe} , at which it was up-to-date. A read transaction at timestamp tt is considered successful if t<=tsafet <= t_{safe} at any replica. Spanner defines tsafet_{safe} as follows:

tsafe=min(tsafePaxos,tsafeTM)t_{safe} = min(t_{safe}^{Paxos} , t_{safe}^{TM})

where,

  • tsafePaxost_{safe}^{Paxos} is the safe time of each Paxos state machine or the latest Paxos write's timestamp. With respect to Paxos, incoming writes will no longer be committed at or below tsafePaxost_{safe}^{Paxos} since timestamps are incremented monotonically, and write transactions are also committed in order.

  • tsafeTMt_{safe}^{TM} is the safe time of each transaction manager. For zero-prepared transactions, the value of tsafeTMt_{safe}^{TM} is infinity. For these transactions, the state to be affected is indeterminate, and even the participant replica isn’t sure if these transactions will commit or not.

Note: For a participant, tsafeTMt_{safe}^{TM} refers to the replica leader's transaction manager. The state of the transaction manager can be inferred by the participant via metadata transmitted with Paxos writes.

The commit protocol guarantees that all nodes know the lower bound of a prepared transaction's timestamp. As a result of the commit protocol, all participants involved in a transaction will be aware of the lower bound of the timestamp of prepared transactions.

As per the Spanner paper, for a group gg and a transaction TiT_{i}, every participant leader assigns a prepare timestamp, si,gprepares_{i,g}^{prepare} to its prepare record. The coordinator leader ensures that the commit timestamp of the transaction is greater than the prepare record for all participant groups gg. This is for every replica in a groupgg, over all transactions TiT_{i}, prepared at gg, tsafeTM=mini(si,gprepare)1t_{safe}^{TM} = min_{i}(s_{i, g}^{prepare}) - 1 over all transactions prepared at gg.

Point to ponder

Question

We calculated tsafeTMt_{safe}^{TM} for the transaction manager as tsafeTM=mini(si,gprepare)1t_{safe}^{TM} = min_{i}(s_{i, g}^{prepare}) - 1. Why is 11 subtracted from the minimum prepared timestamp?

Hide Answer

An explanation can be that we treat a read transaction as a snapshot read—a read from the past. So, 1-1 helps us in achieving that because if we do not subtract 11, we might use a time where a transaction is still in progress (more specifically in a prepare phase).

Assigning timestamps to read-only transactions#

A snapshot read is a read from the past. The client gives a timestamp, and the state of the object or row on that timestamp is returned. Read-only transactions are non-snapshot standalone read. They are executed in the same way as a snapshot read but since timestamp is not given, we calculate sreads_{read} and fetch the data on basis of sreads_{read}.

There are two stages of execution for a read-only transaction as per Spanner paper:

  • Assign a timestamp sreads_{read}

  • Execute the transaction's reads as snapshot reads at sreads_{read}

We can perform snapshot reads on those replicas that are up-to-date. TT.now()TT.now() returns TTinterval that contains the earliest and latest timestamp, which we will assign sread=TT.now().latests_{read} = TT.now().latest at any time after the start of a transaction. However, this timestamp may block data reads at sreads_{read} if tsafet_{safe} hasn't advanced enough. To ensure disjointness, it is important to remember that picking a value for sreads_{read} can also increase smaxs_{max}. Because of this, Spanner should choose the oldest timestamp that maintains external consistency to minimize the possibility of blocking. In the next lesson, we will learn how to choose such a timestamp.

In this lesson, we learned how Spanner does concurrency control over read and write operations. Moreover, we learned how the read operations execute without locking at a timestamp chosen by the system and can only be executed on up-to-date replicas.

Spanner, TrueTime, and the CAP Theorem

Database Operations in Spanner